Impossibility of successful classification when useful features are rare and weak.

نویسنده

  • Jiashun Jin
چکیده

We study a two-class classification problem with a large number of features, out of which many are useless and only a few are useful, but we do not know which ones they are. The number of features is large compared with the number of training observations. Calibrating the model with 4 key parameters--the number of features, the size of the training sample, the fraction, and strength of useful features--we identify a region in parameter space where no trained classifier can reliably separate the two classes on fresh data. The complement of this region--where successful classification is possible--is also briefly discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

Feature Selection by Higher Criticism Thresholding: Optimal Phase Diagram

We consider two-class linear classification in a high-dimensional, low-sample size setting. Only a small fraction of the features are useful, the useful features are unknown to us, and each useful feature contributes weakly to the classification decision – this setting was called the rare/weak model (RW Model) in [11]. We select features by thresholding feature z-scores. The threshold is set by...

متن کامل

Higher Criticism Thresholding: Optimal Feature Selection when Useful Features

Motivated by many ambitious modern applications – genomics and proteomics are examples, we consider a two-class linear classification in high-dimensional, low-sample size setting (a.k.a. p n). We consider the case where among a large number of features (dimensions), only a small fraction of them is useful. The useful features are unknown to us, and each of them contributes weakly to the classif...

متن کامل

Feature selection by higher criticism thresholding achieves the optimal phase diagram.

We consider two-class linear classification in a high-dimensional, small-sample-size setting. Only a small fraction of the features are useful, these being unknown to us, and each useful feature contributes weakly to the classification decision. This was called the rare/weak (RW) model in our previous study (Donoho, D. & Jin, J. 2008 Proc. Natl Acad. Sci. USA 105, 14 790-14 795). We select feat...

متن کامل

Optimal Classification in Sparse Gaussian Graphic Model

Consider a two-class classification problem when the number of features is much larger than the sample size. The features are masked by Gaussian noise with zero means and a covariance matrix Σ, where the precision matrix Ω = Σ is unknown but is presumably sparse. The useful features, also unknown, are sparse and each contributes weakly (i.e., rare and weak) to the classification decision. By ob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 106 22  شماره 

صفحات  -

تاریخ انتشار 2009